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Abstract 

This paper presents a memory efficient VLSI architecture of low complex video encoder using three dimensional 
(3-D) wavelet and Compressed Sensing (CS) is proposed for space and low power video applications. Majority 
of the conventional video coding schemes are based on hybrid model, which requires complex operations like 
transform coding (DCT), motion estimation and deblocking filter at the encoder. Complexity of the proposed 
encoder is reduced by replacing those complex operations by 3-D DWT and CS at the encoder. The proposed 
architecture uses 3-D DWT to enable the scalability with levels of wavelet decomposition and also to exploit the 
spatial and the temporal redundancies. CS provides the good error resilience and coding efficiency. At the first 
stage of the proposed architecture for encoder, 3-D DWT has been applied (Lifting based 2-D DWT in spatial 
domain and Haar wavelet in temporal domain) on each frame of the group of frames (GOF), and in the second 
stage CS module exploits the sparsity of the wavelet coefficients. Small set of linear measurements are extracted 
by projecting the sparse 3-D wavelet coefficients onto random Bernoulli matrix at the encoder. Compared with the 
best existing 3-D DWT architectures, the proposed architecture for 3-D DWT requires less memory and provide 
high throughput. For an NxN image, the proposed 3-D DWT architecture consumes a total of only 2* (3V -I-40P) 
words of on-chip memory for the one level of decomposition. The proposed architecture for an encoder is first of 
its kind and to the best of my knowledge, no architecture is noted for comparison. The proposed VLSI architecture 
of the encoder has been synthesized on 90-nm CMOS process technology and results show that it consumes 90.08 
mW power and occupies an area equivalent to 416.799 K equivalent gate at frequency of 158 MHz. The proposed 
architecture has also been synthesised for the Xilinx zync 7020 series field programmable gate array (FPGA). 

Index Terms : Sealable Video Coding (SVC), Compressed Sensing (CS), 3D wavelets, VLSI. 

1. Introduction 

Current video eoding standards (e.g.,H.264 and HEVC) [[T]|[[2ll are able to provide good eompression 
using a high-eomplexity eneoders. At the eneoder, motion estimation (using bloek-matehing) has been 
applied between adjacent frames to exploit the temporal redundancy. Then each reference and residual 
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frame (motion-compensated differences) is divided in to non overlapping blocks (block size may vary from 
8x8 to 64x64 pixels) and apply the transform coding on each block (e.g., DCT) to exploit the spatial 
redundancy. Motion estimation and transform coding accounts for nearly 70% of the total complexity of the 
encoder Q. Moreover, block wise transform coding leads to blocking artifacts in the motion compensated 
frame and it may reduced by using deblocking filter. However, which may further increase the complexity 
of the encoder. In contrast, the decoder complexity is very low. The main function of the decoder is to 
reconstruct the video frames by using reference frame, motion-compensated residuals and motion vectors. 
They are more suitable for the broadcasting applications, where a high complexity encoder would support 
thousands of low complex decoders. However, conventional video coding schemes are not suitable for 
applications requires low complexity encoders like mobile phones and camcorders. There requires low 
complex, low power and low cost devices. High complex encoder enables increase in compression ratio 
and power consumption. Therefore, to increase battery life in mobile devices, a low-complexity encoder 
with good coding efficiency is highly desirable. 

In a mobile video broadcast network (wireless networks), a video source is broadcast to multiple 
receivers and may have various channel capacities, display resolutions, or computing facilities. It is 
necessary to encode and transmit the video source once, but allow any subset of the bit stream to be 
successfully decoded by a receiver. In order to reduce the error rate in wireless broadcast network, 
error correction coding such as Reed-Solomon (RS) code and convolutional code has been widely used. 
However, this type of channel coding is not flexible. It can correct the bit errors only if the error rate is 
smaller than a given threshold. Therefore, it is hard to find a single channel code suitable for different 
channels having different capacities. For broadcast applications, without the feedback from individual 
receivers, the sender can re-transmit data that are helpful to all the receivers. These requirements are 
indeed difficult and challenging for traditional channel coding design. From the above requirements it is 
desired to have a encoder with less complex, good coding efficiency, error resilience, scalable and support 
the realtime application. 

This paper introduces a new VLSI architecture for scalable low complex encoder using 3-D DWT 
and compressed sensing. Fig. [^a) shows the block diagram of low complex video codec (encoder and 
decoder). Encoder has 3-D DWT and CS as main functional modules shown in Fig. [^b). 3-D DWT 
module provides the scalability with the levels of decomposition and also exploit the spatial and temporal 
redundancies of the video frames. 3-D DWT module of the encoder replaces the transform coding, motion 
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estimation and deblocking filters of the current video coding system. CS module utilize the sparse nature 
of the wavelet coefficients and projects on the random Bernoulli matrices for selecting the measurements 
at the encoder to enable the compression and approximate message passing algorithm for reconstruction 
at the decoder. CS module provides the good compression ratio and improves the error resilience. As a 
result the proposed architecture enjoys lesser complexity at the encoder and marginal complexity at the 
decoder. 

From the last two decades, several hardware designs have been noted for implementation of 2-D DWT 
and 3-D DWT for different applications. Majority of the designs are developed based on three categories, 
viz. (i) convolution based (ii) lifting-based and (hi) B-Spline based. Most of the existing architectures are 
facing the difficulty with larger memory requirement, lower throughput, and complex control circuit. 
In general the circuit complexity is denoted by two major components viz, arithmetic and Memory 
component. Arithmetic component includes adders and multipliers, whereas memory component consists 
of temporal memory and transpose memory. Complexity of the arithmetic components is fully depends 
on the DWT filter length. In contrast size of the memory component is depends on dimensions of the 
image. As image resolutions are continuously increasing (HD to UHD), image dimensions are very high 
compared to filter length of the DWT, as a result complexity of the memory component occupied major 
share in the overall complexity of DWT architecture. 

Convolution based implementations ll5l-[|7I| provides the outputs within less time but require high 
amount of arithmetic resources, memory intensive and occupy larger area to implement. Lifting based a 
implementations requires less memory, less arithmetic complex and possibility to implement in parallel. 
However it require long critical path, recently huge number of contributions are noted to reduce the critical 
path in lifting based implementations. For a general lifting based structure ® provides critical path of 
ATm + 8Ta, by introducing 4 stage pipeline it cut down to Tm + 2Ta. In [Q Huang et ah, introduced a 
flipping structure it further reduced the critical path to T^ + Ta. Though, it reduced the critical path delay 
in lifting based implementation, it requires to improve the memory efficiency. Majority of the designs 
which implement the 2-D DWT, first by applying 1-D DWT in row-wise and then apply 1-D DWT in 
column wise. It require huge amount of memory to store these intermediate coefficients. To reduce this 
memory requirements, several DWT architecture have been proposed by using line based scanning methods 
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llT^ -[fT4 | . Huang et al., [ITOll - lim given brief details of B-Spline based 2-D IDWT implementation and 
discussed the memory requirements for different scan techniques and also proposed a efficient overlapped 
strip-based scanning to reduce the internal memory size. Several parallel architectures were proposed for 
lifting-based 2-D DWT ffTTl - llIOll . Y. Hu et al. Il20ll . proposed a modified strip based scanning and parallel 
architecture for 2-D DWT is the best memory-efficient design among the existing 2-D DWT architectures, 
it requires only 3N -i- 24P of on chip memory for a NxN image with P parallel processing units (PU). 
Several lifting based 3-D DWT architectures are noted in the literature [|^ - ||261 to reduce the critical path 
of the 1-D DWT architecture and to decrease the memory requirement of the 3-D architecture. Among 
the best existing designs of 3-D DWT, Darji et al. [l26ll produced best results by reducing the memory 
requirements and gives the throughput of 4 results/cycle. Still it requires the 4A^^ -f lOA^ on-chip memory. 

Based on the ideas of compressed sensing (CS) [I27l - ll29]| . several new video codecs Il30ll - ll35l have 
been proposed in the last few years. Wakin et al. [|30ll have introduced the compressive imaging and video 
encoding through single pixel camera. From his research results, Wa ki n has established that 3-D wavelet 
transform is a better choice for video compared to 2-D (two-dimensional) wavelet transform. Y. Hou 
and F. Liu OTlI have proposed a system of low complexity, where sparsity extracted is from residuals of 
successive non-key frames and CS is applied on those frames. Key frames are fully sampled resulting in 
increased bit-rate. Moreover, performing motion estimation and compensation while predicting the non key 
frames increases the encoder complexity. S. Xiang and Lin Cai [[32l proposed a CS based scalable video 
coding, in which the base layer is composed of a small set of DCT coefficients while the enhancement 
layer is composed of compressed sensed measurements. It uses DCT for I frames and undecimated DWT 
(UDWT) for CS measurements which increases the complexity at the decoder to a great extent. Jiang et 
al. [|3^ proposed CS based scalable video coding using total variation of the coefficients of temporal DCT. 
Scalability is enabled by multi-resolution measurements while the video signal is reconstructed by total 
variation minimization by augmented Lagrangian and alternating direction algorithms (TVAL3) at the 
decoder. However, it increases the decoder complexity, making hardware implementation quite difficult. 
J. Ma et al. [[35l introduced the fast and simple on-line based encoding and decoding by forward and 
backward splitting algorithm. Though encoder complexity is low, scalability is not achieved and decoder 
complexity is very high. Most of the recently proposed video codecs [[30l - ll35ll . which are assumed to 
be of uniform sparsity, are available for all the video frames and a fixed number of measurements are 
transmitted to decoder for all the frames. Depending on the content of the video frame, sparsity may 
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Figure 1. (a) Block diagram for CS based Scalable Video Codec (b)Detailed block diagram of Encoder 


change. A fixed number of measurements may eause an inerease in bit-rate (deerease in eompression 
ratio). 

This paper introduces a new compressed sensing based low eomplex encoder architeeture using 3- 
D DWT. The proposed method uses the random Bernoulli sequence at the encoder for seleeting the 
measurements and the approximate message passing algorithm for reeonstruetion at the deeoder. Major 
eontributions of the present work may be stated as follows. Firstly the proposed framework has revised the 
MCTF based SVC [|MI model by introdueing eompressed sensing eoneepts to inerease the eompression 
ratio and to reduce the complexity. As a result, the proposed framework ensures low eomplexity at the 
eneoder and marginal eomplexity at the decoder. Seeondly, we proposed a new arehitecture for 3-D DWT, 
which requires only 2 * {3N -f 40P) words of on-ehip memory with a throughput of 8 results/eyele. 
Thirdly, we proposed a effieient arehiteeture for eompressed sensing module. 

Organization of the paper as follows. Fundamentals of 3-D DWT and eompressed sensing is presented 
in Seetion II. Detailed deseription of the proposed arehiteeture for 3-D DWT and eompressed sensing 
modules are provided in section III and IV respectively . Results and eomparison are given in Seetion V. 
Finally, eoneluding remarks are given in Section VI. 
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L frames 



H frames 


Figure 2. 3-D wavelet by combining 2-D spatial and 1-D temporal 


II. Theoretical Framework 

This section presents a theoretical background of the wavelets and compressed sensing. 3-D DWT has 
been used to exploit the spatial and temporal redundancies of the video, thereby it eliminates the complex 
operations like ME, MC and deblocking filter. Compressed sensing is used to provide the error resilience 
and coding efficiency. 


A. Discrete Wavelet Transform 

Lifting based wavelet transform designed by using a series of matrix decomposition specified by the 
Daubechies and Sweledens in [l8]|. By applying the flipping flU to the lifting scheme, the multipliers in the 
longest delay path are eliminated, resulting in a shorter critical path. The original data on which DWT is 
applied is denoted by X[n]; and the 1-D DWT outputs are the detail coefficients H[n] and approximation 
coefficients L[n\. For the Image (2-D) above process is performed in rows and columns as well. Eqns.(l)- 
(6) are the design equations for flipping based lifting (9/7) 1-D DWT ll37l and the same equations are 
used to implement the proposed row processor (1-D DWT) and column processor (1-D DWT). 
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Where a' = 1/a, b' = 1/a/i, d = 1//37, d' = 1/7(5, 75^0 = ajd'^/t/, and K1 = a/d'ydC (HI. The lifting 
step eoefficients a, /5, 7, 5 and sealing coefficient are constants and its values a = —1.586134342, 
/5 = -0.052980118, 7 = 0.8829110762, and 5 = 0.4435068522, and C = 1.149604398. 

Lifting based wavelets are always memory efficient and easy to implement in hardware. The lifting 
scheme consists of three steps to decompose the samples, namely, splitting, predicting (eqn. (1) and (3)), 
and updating (eqn. (2) and (4)). 

Haar wavelet transform is orthogonal and simple to construct and provide fast output. By considering 
the advantages of the Haar wavelets, the proposed architecture uses the Haar wavelet to perform the 1-D 
DWT in temporal direction (between two adjacent frames). Sweldens et al. (4511 developed a lifting based 
Haar wavelet. The equations of the lifting scheme for the Haar wavelet transform is as shown in eqn.Q 

L 
H 

L = + A',) 

H = i(A. - A„) 

Eqn.([^ is extracted by substituting Predict value P{z) as 1 and Update step S{z) value as 1/2 in eqn.Q, 
which is used to develop the temporal processor to apply 1-D DWT in temporal direction (3^'^ dimension). 
Where L and H are the low and High frequency coefficients respectively. 

The process which is shown in Fig. [^represents the one level decomposition in spatial and temporal. 
Among all the sub-bands, only LLL sub-band (LL band of L-frames) is fully sampled and transmitted 
without applying any CS techniques because it represents the image in low resolution (Base layer in 
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SVC domain) which is not sparse. All the other sub-bands (3-D wavelet coeffieients) exeept LLL exhibit 
approximate sparsity (Near to zero) and hard thresholding has been applied (eonsider as zero if value 
is less than threshold). After this step, eonventional eneoders use EZW eoding to encode these wavelet 
eoeffieients whieh is eomplex to implement in hardware. EZW eoding is replaeed by CS in the proposed 
framework whieh exploits the sparsity preserving nature of random Bernoulli matrix by projeeting the 
wavelet eoeffieients onto them. DWT version of eaeh frame eonsists of four sub-bands. All the EE sub¬ 
bands of E-frames have large wavelet eoeffieients. Remaining three bands of E-frames and four sub-bands 
of H-frames exhibits sparsity on which compressed sensing is applied. 

B. Compressed Sensing 

Compressed sensing is an innovative seheme that enables sampling below the Nyquist rate, without (or 
with small) drop in reeonstruetion quality. The basic principle behind the eompressed sensing eonsists in 
exploiting sparsity of the signal in some domain. In the proposed work, CS has been applied in wavelet 
domain. 

Eet X = ..., a;[A^]} be a set of N real and diserete-time samples. Eet s be the representation of x in 

the 'k (transform) domain, that is: 

N 

X = = '^ 'kiSj (9) 

i=l 

where s = [si,..., sn] is a weighted coefficients veetor, Si = {x, \kj), and \k = [\kij'k 2 j....|'k 7 v] is an A^x 
basie matrix. Assume that the veetor x is A'-sparse (K eoeffieients of s are non-zero) in the domain \k, 
and A' A^. To get the sparsity of the signal x, eonventional transform eoding is applied on whole signal 
X (all N samples) by using s = and gives the N transform eoeffieients. Among the N eoeffieients, 
N — K or more eoeffieients are disearded because they earry negligible energy and the remaining are 
eneoded. The basie idea of CS is to remove this “sampling redundaney” by taking only M samples of the 
signal, where K < M N. Eet y be an M-element measurement vector given hy. y = ^x or y = 
with y G G are non-adaptive linear projeetions of a signal x G with typieally M N. 

Reeovering the original signal x means solving an under-determined linear equation with usually no 
unique solution. However, the signal x can be reeovered losslessly from M > K measurements, if 
the measurement matrix $ is designed in such a way that, it should preserve the geometry of the sparse 
signals and each of its M x A' sub-matrices possesses full rank. This property is ealled Restrieted Isometry 
Property (RIP) and mathematieally, it ensures that ||a;l — a;2||2 ~ ||<ha;l — <ha;2||2. Where |||/||2 represents 



9 


the ^ 2 - norm of the veetor y. It has been observed that the random matrices drawn from independent 
and identically distributed (i.i.d.) Gaussian or Bernoulli distributions satisfy the RIP property with high 
probability. 

The problem of signal recovery from CS measurements is very well studied in the recent years and there 
exists a host of algorithms that have been proposed such as Orthogonal Matching Pursuit (OMP) [|38l- 
iSOl, Iterative Hard-Thresholding (IHT) iHTI . Iterative Soft-Thresholding (1ST) 021. Although recently 
introduced Approximate Message Passing (AMP) algorithm fl4^ shows a similar structure to IHT and 
1ST, it exhibits faster convergence. Literature ll^ . [|44l shows that AMP performs excellently for many 
deterministic and highly structured matrices. 

HI. Proposed architecture for 3-D DWT 

The proposed architecture for 3-D DWT comprising of two parallel spatial processors (2-D DWT) 
and four temporal processors (1-D DWT), is depicted in Fig. [^b). After applying 2-D DWT on two 
consecutive frames, each spatial processor (SP) produces 4 sub-bands, viz. LL, HL, LH and HH and are 
fed to the inputs of four temporal processors (TPs) to perform the temporal transform. Output of these 
TPs is a low frequency frame (L-frame) and a high frequency frame (H-frame). Architectural details of 
the spatial processor and temporal processors are discussed in the following sections. 

A. Architecture for Spatial Processor 

In this section, we propose a new parallel and memory efficient lifting based 2-D DWT architecture 
denoted by spatial processor (SP) and it consists of row and column processors. The proposed SP is a 
revised version of the architecture developed by the Y. Hu et al. [l20l . The proposed architecture utilizes 
the strip based scanning ll20ll to enable the trade-off between external memory and internal memory. To 
reduce the critical path in each stage flipping model [|9]|- [l37]l is used to develop the processing element 
(PE). Each PE has been developed with shift and add techniques in place of multiplier. Lifting based 
(9/7) 1-D DWT process has been performed by the processing unit (PU) in the proposed architecture. To 
reduce the CPD, processing unit is designed with five pipeline stages and multipliers are replaced with 
shift and add techniques. This modified PU reduces the CPD to 2Ta (two adder delay). Fig. |^a) shows the 
data flow graph (DEG) of the proposed PU and Fig. [^b) depicts the internal architecture of the proposed 
PU. The number of inputs to the spatial processor is equal to 2P-I-1, which is also equal to the width 
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Figure 3. (a) Data Flow Graph of modified 1-D DWT architecture (b)Structure of Processing Unit 


of the strip. Where P is the number of parallel processing units (PUs) in the row processor as well as 
column processor. We have designed the proposed architecture with two parallel processing units (P = 2). 
The same structure can be extended to P = 4, 8, 16 or 32 depending on external bandwidth. Whenever 
row processor produces the intermediate results, immediately column processor start to process on those 
intermediate results. Row processor takes 5 clocks to produce the temporary results then after column 
processor takes 5 more clocks to to give the 2-D DWT output; finally, temporal processor takes 2 more 
clock after 2-D DWT results are available to produce 3-D DWT output. As a summary, proposed 2-D 
DWT and 3-D DWT architectures have constant latency of 10 and 12 clock cycles respectively, regardless 
of image size N and number of parallel PUs (P). Details of the row processor and column processor are 
given in the following sub-sections. 

1) Row Processor (RP): Let X be the image of size NxN, extend this image by one column by using 
symmetric extension. Now image size is Nx(N-fI). Refer [[201 for the structure of strip based scanning 
method. The proposed architecture initiates the DWT process in row wise through row processor (RP) 
then process the column DWT by column processor (CP). Fig. shows the generalized structure for 
a row processor with P number of PUs. P = 2 has been considered for our proposed design. For the 
first clock cycle, RP get the pixels from X(0,0) to X(0,2P) simultaneously. For the second clock RP 
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Figure 4. (a)Row Processor (b) Column Processor 



Figure 5. (a) Transpose Register (Ref: 1201 ) (b) Re-arrange Unit 
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gets the pixels from next row i.e. X(1,0) to X(1,2P), the same proeedure eontinues for eaeh elock till it 
reaches the bottom row i.e., X(N,0) to X(N,2P). Then it goes to the next strip and RP get the pixels from 
X(0,2P) to X(0,4P) and it continues this procedure for entire image. Each PU consists of five pipeline 
stages and each pipeline stage is processed by one processing element (PE) as depicted in Eig. [^b). Eirst 
stage (shift PE) provide the partial results which is required at 2"^*^ stage (PE alpha), likewise Processing 
elements PE alpha to PE delta (2’^'^stage to 5*^ stage) gives the partial results along with their original 
outputs, i.e. consider the PU-1, PE_alpha needs to provide output corresponding to eqn.(l) (Hi[n\), along 
with Hi[n], it also provides the partial output X'[2n] which is required for the PE beta. Structure of the 
PEs are given in the Eig.|^b), it shows that multiplication is replaced with the shift and add technique. The 
original multiplication factor and the value through the shift and add circuit are noted in Tablej^ it shows 
that variation between original and adopted one is extremely small. The maximum CPD provided by the 
these PEs is 2Ta. The outputs Hi[n + P — 1], Li[n + P — 1], and H 2 [n + P — 1] corresponding to PE_alpha 
and PE beta of last PU and PE gama of last PU is saved in the memories Memory alpha, Memory beta 
and Memory gama respectively. Those stored outputs are inputted for next subsequent columns of the 
same row. Eor a NxN image rows is equivalent to N. So the size of the each memory is Nx 1 words and 
total row memory to store these outputs is equals to 3N. Output of each PU are under gone through a 
process of scaling before it producing the outputs H and E. These outputs are fed to the transposing unit. 
The transpose unit has P number of transpose registers (one for each PU). Pig. [^a) shows the structure 
of transpose register, and it gives the two H and two E data alternatively to the column processor. 

2) Column Processor (CP): The structure of the Column Processor (CP) is shown in Pig. j^b). To 
match with the RP throughput, CP is also designed with two number of PUs in our architecture. Each 
transpose register produces a pair of H and E in an alternative order and are fed to the inputs of one PU 
of the CP. The partial results produced are consumed by the next PE after two clock cycles. As such, 
shift registers of length two are needed within the CP between each pipeline stages for caching the partial 
results (except between 1'^* and 2"^'^ pipeline stages). At the output of the CP, four sub-bands are generated 
in an interleaved pattern, i.e., (HE,HH), (EE,EH), (HE,HH), (EE,EH), and so on. Outputs of the CP are 
fed to the re-arrange unit. Pig. |^b) shows the architecture for re-arrange unit, and it provides the outputs 
in sub-band order i.e EE, EH, HE and HH simultaneously, by using P registers and 2P multiplexers. Eor 
multilevel decomposition, the same DWT core can be used in a folded architecture with an external frame 
buffer for the EE sub-band coefficients. 
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Table I 

Original and adopted values eor multiplication 


PF 

Original 

Multiplier 

Value 

Multiplier 
value through 
shift and add 

PFalpha 

a'=-0.6305 

a'=-0.6328 

PFbeta 

6'= 11.90 

6'=12 

PF gama 

c'=-21.378 

c'=-21.375 

PFdelta 

d'=2.55 

d'=2.565 


B. Architecture for Temporal Processor (TP) 

Eqn.([^ shows that Haar wavelet transform depends on two adjaeent pixels values. As soon as spatial 
proeessors are provide the 2-D DWT results, temporal proeessors starts proeessing on the spatial proeessor 
outputs (2-D DWT results) and produee the 3-D DWT results. Fig. [^b) shows that there is no requirement 
of temporal buffer, due to the sub-band eoefficients of two spatial processors are directly connected to the 
four temporal processors. But it has been designed with 2 pipeline stages, it require 8 pipeline registers 
for each TP. Same frequency sub-band of the distinct spatial processors are fed to the each temporal 
processor, i.e. LL, HL, LH and HH sub-bands of spatial processor 1 and 2 are given as inputs to the 
temporal processor 1, 2, 3 and 4 respectively. Temporal processor apply 1-D Haar wavelet on sub-band 
coefficients, and provide the low frequency sub-band and high frequency sub-band as output. By combining 
all low frequency sub-bands and high frequency sub-bands of all temporal processors provide the 3-D 
DWT output in the form of L-Frame and H-Frame (2-D DWT by spatial processors and 1-D DWT by 
temporal processors). 


IV. Architecture eor Compressed Sensing Module 

The proposed 3-D DWT module, simultaneously works on two video frames of size N x N and provide 
eight 3-D DWT sub-bands as its output. As shown in Fig. [^b), CS is applied on all sub-bands of 3-D 
DWT outputs, except FFF band (FF band of F-Frame) and each sub-band is connected to one CS module. 
Size of the each sub-band equals to the half of the original frame for one level decomposition (N/2xN/2). 
The main function of the CS module is to calculate the measured matrix y from <I) and x by using the 
CS equation y = ^x. Where a; is a input vector (for which CS need to calculate). Size of x is equal to 
P* N/2 (N/2 is the height of single column in a sub-band), because proposed 3-D DWT simultaneously 
works on P columns due to P number of PUs in the spatial processor. Proposed architecture has been 
designed with P = 2; so for each clock, alternative column coefficients are provided by the 3-D DWT 
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Compressed Sensing Module 


Figure 6. Internal architecture of CS module 


module for each sub-band. With P equals to 2, the size of x is [2*N/2]x 1 = Nx 1, and is the randomly 
generated Bernoulli matrix, size of the $ is MxN, (M > cK\og{N/K) for some small constant c). Value 
of K (£o-norm) of the input vector x. We have tested for different video sequences of size 512x512 and 
1024 X1024 with different threshold values (wavelet coefficient value less than the threshold value consider 
as zero) have been observed and it shows that the value of K is not more than N/8 for given x of size 
Nx 1. Based on those observations, value of M has been fixed to N/4. 

Fig. shows the internal architecture of CS module. Proposed architecture for CS based encoder has 
seven CS modules, one for each sub-band except LLL sub-band. The structure of seven CS modules are 
same and works simultaneously. For all these seven CS modules only one Bernoulli matrix has been used 
and it is stored in ROM, denoted by Bem mat. The size of the Bern mat is MxN, each location has 
M bits representing one entire column and number of locations equals to N. Bernoulli matrix has been 
generated by using ‘binord’ function in the Matlab tool (<h = binomd(l,0.5,M,N)), with equal probability 
for 0 and 1 of size MxN. Here bit ‘0’ represents the value ‘-f1’ and 1 represents ‘-F. This generated 
Bernoulli matrix has been loaded in the Bern_mat (ROM) locations and is used by all CS modules. As 
shown in Fig. input for a CS module is data in which is sub-band out from 3-D DWT. For every clock 
one 15-bit data in will arrive (alternative column per each clock). \n y = ^x, y is column matrix of size 

N 

Mxl, which is represented as y = [yo, yi, y 2 , y^,, . Vm-iV, yi=Yl, ^ik^k or we can also calculate 

fc=i 

iterative fashion for every (n-fl)*^ clock yi{n + l) = yi{n) -F ^ikXk{n + l), it require N clocks to complete 
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this operation, because /c = 0 to N-1. 

The proposed architecture uses M adders, one for each individual measurement ?/*. One input of the 
adder is ^ikXk which is the output of a multiplexer, where ^ik is either 0 or 1, if ^ik is 0, then Xk 
multiply with +1 {^ikXk =data_in), otherwise multiply with -1 {^ikXk = 2’s compliment of (data_in )), 
this task has been done by connecting the ^ik as a selection line of the multiplexer and first and second 
inputs of the multiplexer is Xk and —Xk respectively. The second input of adder is from partial result 
of yi in the previous clock. The proposed architecture for CS module utilize the two registers to store 
the M measurements (y) namely, Y msrl and Y_msr2, each of capacity M*16 bits (16 bits for each 
measurement). Y msrl is used store partial results of yi from 0 to N-1 clocks. Just after completing N 
clocks, measurements are ready and are available in Y msrl, then control circuit transfer the Y msrl 
data to Y_msr2 and clear the Y msrl for next set of measurements. The above procedure is repeated 
for all the columns of sub-band at the same time calculated measurements y^ each of 16-bit are send as 
output (Y out) from Y_msr2 by shifting 16 bits for each clock. This procedure is followed for all the 
seven sub-bands. Each measured matrix y is sent for the entropy coding (Golomb Rice Coding) block 
and coded bit streams are transmitted through channel. LLL sub-band is directly coded by entropy coding 
block and then transmitted through channel by considering as a base layer. Entropy coding is out of scope 
of this paper, not discussed in this paper. 

V. Results and Performance Comparison 

A. Simulation Results 

The proposed encoder has been simulated by using Matlab tool and functionality has been verified 
on cyclone (Downloaded from the NASA website) and clock video sequences of 512x512 resolution, 
viplane and foreman video sequences of 256x256 resolution. After applying the 3-D DWT, all the HE, 
EH and HH sub-bands of E-Erames and EE, HE, EH and HH sub-bands of H-Erames are sent to CS. After 
applying the CS on 3-D DWT coefficients measurements are passed through the entropy coder (Golomb 
Rice coding -i- run length encoding). Percentage of measurements are calculated before entropy coding. 
Compression Ration is the ratio of total number of bits in input frame and number of bits after the entropy 
coding. Table shows that performance of the proposed framework competes with the existing IBMCTE 
m and H.264 m. Performance in terms of compression ratio and PSNR of the proposed encoder and 
decoder for clock, cyclone and Viplane video sequences are noted from the level 1 to level 3 in Table 

nni 
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Table II 

Performance of Proposed Framework with IBMCTF and H.264 


CODEC 

Video 

Compression Ratio 

PSNR (dB) 

Proposed 

clock 

24.24 

44.01 

cyclone 

16.85 

34.2 

viplane 

20.96 

37.5 

IB-MCTE (361 

clock 

7.33 

46.2 

cyclone 

5.08 

40.6 

viplane 

5.28 

47.33 

H.264 P 

clock 

62.33 

42.65 

cyclone 

22.1 

38.4 

viplane 

37.8 

40.57 


Table III 

Performance of the proposed framework for different video sequences and different levels 


Video clip 

level 

PSNR 

Compression Ratio 

% of measurements 
before Entropy coding 

Clock 

512x512 
(Slow motion) 

1 

44 

24.24 

34.99 

2 

33.2 

41.67 

23.82 

3 

30.12 

53.23 

20.52 

Cyclone 

512x512 
(High motion) 

1 

34 

16.85 

43.7 

2 

29 

20.56 

38.61 

3 

25.5 

23.3 

36.6 

Viplane 

256x256 

(Medium motion) 

1 

37.5 

20.96 

32.7 

2 

31.5 

35.63 

23.5 

3 

28 

65.54 

18.12 


B. Synthesis Results 

The proposed arehitecture for CS based low complex video encoder has been described in Verilog HDL. 
Simulation results have been verified by using Xilinx ISE simulator. We have simulated the Matlab model 
which is similar to the proposed CS based low complex video encoder architecture and verified the 3-D 
DWT coefficients and CS measurements. RTL simulation results have been found to exactly match the 
Matlab simulation results. The Verilog RTL code is synthesised using Xilinx ISE 14.2 tool and mapped to 


a Xilinx programmable device (EPGA) 7z020clg484 (zync board) with speed grade of -3. Table IV shows 
the device utilisation summary of the proposed architecture and it operates with a maximum frequency of 
265 MHz. The proposed architecture has also been synthesized using SYNOPSYS design compiler with 
90-nm technology CMOS standard cell library. Synthesis results of the proposed encoder is provided in 
Table |V[ it consumes 90.08 mW power and occupies an area equivalent to 416.799 K equivalent gate 
count at frequency of 158 MHz. 
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Table IV 

Device utilization summary of the proposed Encoder 


Logic utilized 

Used 

Available 

Utilization (%) 

Slice Registers 

15917 

106400 

14% 

Number of Slice LUTs 

47303 

53200 

88% 

Number of fully 
used LUT-FF pairs 

15523 

47697 

32% 

Number of Block RAM 

3 

140 

2% 


Table V 

Synthesis Results for Proposed Encoder 


Combinational Area 

1072673 

Non Combinational Area 

915778 

Total Cell Area 

1988451 

Interconnect area 

316449 pm’^ 

Operating Voltage 

1.2 V 

Total Dynamic Power 

80.17 mW 

Cell Leakage Power 

9.90 mW 


C. Comparison 

Table [Vl| shows the eomparison of proposed 3-D DWT arehiteeture with existing 3-D DWT arehiteeture. 
It is found that, the proposed design has less memory requirement, High throughput, less eomputation time 
and minimal latency compared to [l22ll . [|23]| . [I24l . and [f2^ . Though the proposed 3-D DWT architecture 
has small disadvantage in area and frequency, when compared to [f24ll . the proposed one has a great 
advantage in remaining all aspects. 

Table |VII gives the comparison of synthesis results between the proposed 3-D DWT architecture and 
l(26ll . It seems to be proposed one occupying more cell area, but it included total on chip memory also, 
where as in Il26ll on chip memory is not included. Power consumption of the proposed 3-D architecture 
is very less compared to 


VI. Conclusions 

In this paper, we have proposed memory efficient and high throughput architecture for CS based low 
complex encoder. The proposed architecture is implemented on 7z020clg484 FPGA target of zync family, 
also synthesized on Synopsys’ design vision for ASIC implementation. An efficient design of 2-D spatial 
processor and 1-D temporal processor reduces the internal memory, latency, CPD and complexity of a 
control unit, and increases the throughput. When compared with the existing architectures the proposed 
scheme shows higher performance at the cost of slight increase in area. The proposed encoder architecture 
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Table VI 

Comparison of proposed 3-D DWT architecture with existing architectures (eor 1-level) 


Parameters 

Weeks 122] 

Taghavi 123] 

A.Das 124] 

Darji 1261 

Proposed 

Memory requirement 

6Af^-i-6/ 

5N'^ 

5N'^ + 5N 

AN'^ + ION 

2*(3N-f40P) 

Throughput/cycle 

- 

1 result 

2 results 

4 results 

8 results 

Computing time 

For 2 Frames 

2N^ -F 31/2 

6N^ 

3N^ 

3N^ 

NV2P 

Latency 

2.5N'^ + 0.51 

cycles 

2N'‘‘ cycles 

3N'^/2 cycles 

12 cycles 

Area 

- 

- 

1825 slices 

2490 slices 

2852 slice LUTs 

Operating 

Frequency 

200 MHz (ASIC) 

- 

321 MHz 
(FPGA) 

91.87 MHz 
(FPGA) 

265 MHz 
(FPGA) 

Multipliers 

- 

- 

Nil 

30 

Nil 

Adders 

61 MACS 

- 

78 

48 

176 

Filter bank 

/-length 

D-9/7 

D-9/7 

D-9/7 

D-9/7 (2-D) -F Haar (1-D) 


Table VII 

Synthesis Results (Design Vision) Comparison oe Proposed 3-D DWT architecture with existing 


Parameters 

Darji et ah, [1261 

Proposed 

Combinational Area 

61351 /im^ 

526419 /im" 

Non Combinational Area 

807223 /im^ 

553078 fim'^ 

Total Cell Area 

868574 /im^ 

1079498 

Operating Voltage 

1.98 V 

1.2 V 

Total Dynamic Power 

179.75 mW 

38.56 mW 

Cell Leakage Power 

46.87 fiW 

4.86 mW 


is capable of computing 60 UHD (3840x2160) frames in a second. The proposed architecture is also 
suitable for scalable video coding. In addition, the complexity of the encoder is reduced to a great extent. 
The proposed encoder is considered to be suitable for applications including satellite communication, 
wireless transmission and data compression by high speed cameras. 
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